The long-term preservation of Web content

نویسندگان

  • Michael Day
  • Julien Masanès
چکیده

Web archiving initiatives exist to collect ephemeral Web content for use by current and future generations of users. To date, most such initiatives have concentrated on the development of strategies and software tools for the collection of Web content and for providing current access to this content through interfaces like the Internet Archive's Wayback Machine. The International Internet Preservation Consortium (IIPC) is currently building on this legacy with the collaborative development of a set of tools that can be used for the capture of Web sites and for the navigation and searching of Web archives. The focus on collection strategies and tools is a response to what is perhaps the most significant challenge of the Web from an information management perspective. Its dynamic nature means that pages, sites and even whole domains are continually evolving or disappearing. It is difficult to get accurate and up-to-date statistics on Web page longevity, but a range of studies hint at the ultra dynamic nature of the Web. A study by Lawrence, et al. (2001) cited an Alexa Internet estimate that pages disappeared on average after 75 days. Longitudinal studies of Web page persistence by Koehler (2004) found that just 33.8 per cent of a sample set of pages selected in December 1996 persisted at their original URLs by May 2003. Studies of the longevity of Web references in scientific journals show similar trends. For example, a 2003 study of Internet ci-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Mets Based Information Package For Long Term Accessibility Of Web Archives

The British Library’s web archive comprises several terabyte of harvested websites. Like other content streams this data should be ingested into the library’s central preservation repository. The repository requires a standardized Submissionand Archival Information Package. Harvested Websites are stored in Archival Information Packages (AIP). Each AIP is described by a METS file. Operational me...

متن کامل

Turning pure Web Page Storages into Living Web Archives

Web content plays an increasingly important role in the knowledge-based society, and the preservation and long-term accessibility of Web history has high value (e.g., for scholarly studies, market analyses, intellectual property disputes, etc.). There is strongly growing interest in its preservation by libraries and archival organizations as well as emerging industrial services. Web content cha...

متن کامل

The Effect of Using Word Clouds on EFL Students’ Long- Term Vocabulary Retention

                                                                                                                                                                                                                       Vocabulary is an important component in all four skills of language. Issue of vocabulary retention has great importance to EFL teachers in instructional contexts because they always ...

متن کامل

First Results on Detecting Term Evolutions∗

ABSTRACT The archival of content like publications or web pages is just the first step toward “full” content preservation. It also has to be guaranteed that content can be found and interpreted in the long run. The correspondence between the terminology used for querying and the one used in content objects to be retrieved, is a crucial prerequisite for effective retrieval technology. However, a...

متن کامل

Towards smart storage for repository preservation services

The move to digital is being accompanied by a huge rise in volumes of (born-digital) content and data. As a result the curation lifecycle has to be redrawn. Processes such as selection and evaluation for preservation have to be driven by automation. Manual processes will not scale, and the traditional signifiers and selection criteria in older formats, such as print publication, are changing. T...

متن کامل

Migrating Content in WARC Files

Heritage institutions all over the world started on harvesting and preserving resources of the World Wide Web for future generations as part of our culture heritage. This task tends to be a non-trivial one because of two complex challenges: (1) crawling the enormous data amount located in the Internet and (2) performing long term preservation strategies on these data. Nowadays a lot of effort i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006